Lake County
A Targeted Learning Framework for Estimating Restricted Mean Survival Time Difference using Pseudo-observations
A targeted learning (TL) framework is developed to estimate the difference in the restricted mean survival time (RMST) for a clinical trial with time-to-event outcomes. The approach starts by defining the target estimand as the RMST difference between investigational and control treatments. Next, an efficient estimation method is introduced: a targeted minimum loss estimator (TMLE) utilizing pseudo-observations. Moreover, a version of the copy reference (CR) approach is developed to perform a sensitivity analysis for right-censoring. The proposed TL framework is demonstrated using a real data application.
STREETS: A Novel Camera Network Dataset for Traffic Flow
In this paper, we introduce STREETS, a novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL. We seek to address the limitations of existing datasets in this area. Many such datasets lack a coherent traffic network graph to describe the relationship between sensors.
HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
Abdaljalil, Samir, Kurban, Hasan, Serpedin, Erchin
Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.
Achieving Operational Universality through a Turing Complete Chemputer
Gahler, Daniel, Thomas, Dean, Lach, Slawomir, Cronin, Leroy
The most fundamental abstraction underlying all modern computers is the Turing Machine, that is if any modern computer can simulate a Turing Machine, an equivalence which is called Turing completeness, it is theoretically possible to achieve any task that can be algorithmically described by executing a series of discrete unit operations. In chemistry, the ability to program chemical processes is demanding because it is hard to ensure that the process can be understood at a high level of abstraction, and then reduced to practice. Herein we exploit the concept of Turing completeness applied to robotic platforms for chemistry that can be used to synthesise complex molecules through unit operations that execute chemical processes using a chemically-aware programming language, XDL. We leverage the concept of computability by computers to synthesizability of chemical compounds by automated synthesis machines. The results of an interactive demonstration of Turing completeness using the colour gamut and conditional logic are presented and examples of chemical use-cases are discussed. Over 16.7 million combinations of Red, Green, Blue (RGB) colour space were binned into 5 discrete values and measured over 10 regions of interest (ROIs), affording 78 million possible states per step and served as a proxy for conceptual, chemical space exploration. This formal description establishes a formal framework in future chemical programming languages to ensure complex logic operations are expressed and executed correctly, with the possibility of error correction, in the automated and autonomous pursuit of increasingly complex molecules.
Optimal Survey Design for Private Mean Estimation
Chen, Yu-Wei, Pasupathy, Raghu, Awan, Jordan A.
This work identifies the first privacy-aware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation, which amplifies the privacy guarantee; however, to have the same final privacy guarantee for each group, different nominal privacy budgets need to be used depending on the subsampling rate. Ignoring the effect of DP, traditional stratified sampling strategies risk significant variance inflation. We phrase our optimal survey design as an optimization problem, where we determine the optimal subsampling sizes for each group with the goal of minimizing the variance of the resulting estimator. We establish strong convexity of the variance objective, propose an efficient algorithm to identify the integer-optimal design, and offer insights on the structure of the optimal design.
Letters, Colors, and Words: Constructing the Ideal Building Blocks Set
Salazar, Ricardo, Jamshidi, Shahrzad
Define a building blocks set to be a collection of n cubes (each with six sides) where each side is assigned one letter and one color from a palette of m colors. We propose a novel problem of assigning letters and colors to each face so as to maximize the number of words one can spell from a chosen dataset that are either mono words, all letters have the same color, or rainbow words, all letters have unique colors. We explore this problem considering a chosen set of English words, up to six letters long, from a typical vocabulary of a US American 14 year old and explore the problem when n = 6 and m = 6, with the added restriction that each color appears exactly once on the cube. The problem is intractable, as the size of the solution space makes a brute force approach computationally infeasible. Therefore we aim to solve this problem using random search, simulated annealing, two distinct tree search approaches (greedy and best-first), and a genetic algorithm. To address this, we explore a range of optimization techniques: random search, simulated annealing, two distinct tree search methods (greedy and best-first), and a genetic algorithm. Additionally, we attempted to implement a reinforcement learning approach; however, the model failed to converge to viable solutions within the problem's constraints. Among these methods, the genetic algorithm delivered the best performance, achieving a total of 2846 mono and rainbow words.
Federated Discrete Denoising Diffusion Model for Molecular Generation with OpenFL
Ta, Kevin, Foley, Patrick, Thieme, Mattson, Pandey, Abhishek, Shah, Prashant
Generating unique molecules with biochemically desired properties to serve as viable drug candidates is a difficult task that requires specialized domain expertise. In recent years, diffusion models have shown promising results in accelerating the drug design process through AI-driven molecular generation. However, training these models requires massive amounts of data, which are often isolated in proprietary silos. OpenFL is a federated learning framework that enables privacy-preserving collaborative training across these decentralized data sites. In this work, we present a federated discrete denoising diffusion model that was trained using OpenFL. The federated model achieves comparable performance with a model trained on centralized data when evaluating the uniqueness and validity of the generated molecules. This demonstrates the utility of federated learning in the drug design process. OpenFL is available at: https://github.com/securefederatedai/openfl
Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion
Li, Muzhi, Yang, Cehao, Xu, Chengjin, Jiang, Xuhui, Qi, Yiyan, Guo, Jian, Leung, Ho-fung, King, Irwin
The Knowledge Graph Completion~(KGC) task aims to infer the missing entity from an incomplete triple. Existing embedding-based methods rely solely on triples in the KG, which is vulnerable to specious relation patterns and long-tail entities. On the other hand, text-based methods struggle with the semantic gap between KG triples and natural language. Apart from triples, entity contexts (e.g., labels, descriptions, aliases) also play a significant role in augmenting KGs. To address these limitations, we propose KGR3, a context-enriched framework for KGC. KGR3 is composed of three modules. Firstly, the Retrieval module gathers supporting triples from the KG, collects plausible candidate answers from a base embedding model, and retrieves context for each related entity. Then, the Reasoning module employs a large language model to generate potential answers for each query triple. Finally, the Re-ranking module combines candidate answers from the two modules mentioned above, and fine-tunes an LLM to provide the best answer. Extensive experiments on widely used datasets demonstrate that KGR3 consistently improves various KGC methods. Specifically, the best variant of KGR3 achieves absolute Hits@1 improvements of 12.3% and 5.6% on the FB15k237 and WN18RR datasets.
Characteristics of Political Misinformation Over the Past Decade
Although misinformation tends to spread online, it can have serious real-world consequences. In order to develop automated tools to detect and mitigate the impact of misinformation, researchers must leverage algorithms that can adapt to the modality (text, images and video), the source, and the content of the false information. However, these characteristics tend to change dynamically across time, making it challenging to develop robust algorithms to fight misinformation spread. Therefore, this paper uses natural language processing to find common characteristics of political misinformation over a twelve year period. The results show that misinformation has increased dramatically in recent years and that it has increasingly started to be shared from sources with primary information modalities of text and images (e.g., Facebook and Instagram), although video sharing sources containing misinformation are starting to increase (e.g., TikTok). Moreover, it was discovered that statements expressing misinformation contain more negative sentiment than accurate information. However, the sentiment associated with both accurate and inaccurate information has trended downward, indicating a generally more negative tone in political statements across time. Finally, recurring misinformation categories were uncovered that occur over multiple years, which may imply that people tend to share inaccurate statements around information they fear or don't understand (Science and Medicine, Crime, Religion), impacts them directly (Policy, Election Integrity, Economic) or Public Figures who are salient in their daily lives. Together, it is hoped that these insights will assist researchers in developing algorithms that are temporally invariant and capable of detecting and mitigating misinformation across time.
STREETS: A Novel Camera Network Dataset for Traffic Flow
In this paper, we introduce STREETS, a novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL. We seek to address the limitations of existing datasets in this area. Many such datasets lack a coherent traffic network graph to describe the relationship between sensors. The datasets that do provide a graph depict traffic flow in urban population centers or highway systems and use costly sensors like induction loops. These contexts differ from that of a suburban traffic body.